Presentation Video Link = https://binusianorg-my.sharepoint.com/personal/karina_wardoyo_binus_ac_id/_layouts/15/guestaccess.aspx?share=Eu9ZPmcecUNOl4Ch_qA2X9YBo3SyreCdMsLYkp0ZlNAnnw&e=CdiaCa
## age job marital education default housing loan contact
## 0 83 retired divorced basic.4y no no no cellular
## 1 32 services married high.school no no no telephone
## 2 31 admin. single university.degree no yes no telephone
## 3 33 admin. single university.degree no yes no telephone
## 4 39 services married high.school no yes no telephone
## 5 57 management married university.degree no no no cellular
## month day_of_week duration campaign pdays previous poutcome y
## 0 nov tue 242 1 3 3 success yes
## 1 may tue 190 3 999 0 nonexistent no
## 2 may wed 325 2 999 0 nonexistent no
## 3 aug thu 158 1 999 0 nonexistent no
## 4 jul mon 158 1 999 0 nonexistent no
## 5 jul tue 101 1 999 0 nonexistent no
Reading the CSV, also making the 1st column became the index for rows using row.names=1 but after that column ‘X’ will no longer show as column but became index, after that display the first 6 rows of data
## age job marital education default housing
## 0 0 0 0 0 0
## loan contact month day_of_week duration campaign
## 0 0 0 0 12 0
## pdays previous poutcome y
## 0 0 0 0
Based on what we checked there is only 1 column with missing values, that is the duration column with 12 missing values.
## [1] 0
Based on what we checked there is no duplicated data that exist in the dataset.
##
## Column: job
##
## admin. blue-collar entrepreneur housemaid
## 8 2033 1870 294 219
## management retired self-employed services student
## 546 366 263 822 184
## technician unemployed unknown
## 1372 196 65
##
## Column: marital
##
## divorced married single unknown
## 892 4994 2338 14
##
## Column: education
##
## basic.4y basic.6y basic.9y high.school
## 817 486 1243 1915
## illiterate professional.course university.degree unknown
## 6 1043 2379 349
##
## Column: default
##
## no unknown
## 6491 1747
##
## Column: housing
##
## no unknown yes
## 3749 205 4284
##
## Column: loan
##
## no unknown yes
## 6741 205 1292
##
## Column: contact
##
## cellular telephone
## 5232 3006
##
## Column: month
##
## apr aug dec jul jun mar may nov oct sep
## 529 1189 38 1481 1074 105 2729 827 149 117
##
## Column: day_of_week
##
## fri mon thu tue wed
## 1524 1676 1769 1599 1670
##
## Column: poutcome
##
## failure nonexistent success
## 836 7136 266
##
## Column: y
##
## no yes
## 7329 909
Based on what we checked in categorical columns there is a few unknown categories and a few blank values such as ” ” in job column and “unknown” in job, marital, education, housing and loan columns.
## age job marital education default housing
## TRUE FALSE FALSE FALSE FALSE FALSE
## loan contact month day_of_week duration campaign
## FALSE FALSE FALSE FALSE TRUE TRUE
## pdays previous poutcome y
## TRUE TRUE FALSE FALSE
Based on what we checked the numerical columns are age, duration, campaign, pdays, previous. So, we will only make the boxplots of these columns for checking outliers and extreme values, except the pdays because most of the datas is on 999 which means not previously contacted
Based on the boxplot the biggest outlier is age 89 but technically it is still possible for someone to be 89 years old, so the outliers in age column is considered valid outliers.
Based on the boxplot the biggest outlier is 971 seconds and above that is considered extreme values, the most extreme value is on 4918 seconds.
Based on the boxplot the biggest outlier is 9 contacts made, and it is still not considered extreme, and data above 9 is considered extreme, with the most extreme is 43 contacts made
Based on the boxplot the biggest extreme values is 6 contacts made, and no outliers were identified so anything above 0 means extreme values
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 4 4 13 87 29 7 97 11 1 7 7 1 4 7 4 4
## 17 19 25 26 999
## 1 1 1 1 7947
The irrelevant values aren’t the exact data value, irrelevant values are sometimes considered as a value that symbolises something, in this case 999 in the column pdays means not previously contacted. The 999 can actually affect the mean and median of this column.
Make a duplicate of df and called clean_data the clean_data is the dataset that we will modify and make it cleaner and ready to get processed
## Missing values in duration after imputation: 0
## Median used for imputation: 180 seconds
Searching the median for duration column which is 180 and all missing values has been replaced with the median value which is 180, this makes there are no longer missing value in duration column. Also, because duration is a numerical column it is better to fill the missing value with median because it is more robust to outliers and extreme values.
## age job marital education default housing
## 0 0 0 0 0 0
## loan contact month day_of_week duration campaign
## 0 0 0 0 0 0
## pdays previous poutcome y
## 0 0 0 0
If we check again for the missing values in df but now is copied and modified to clean_data, the duration shows that it has 0 missing values. Also, from this result this indicates that none of the column has missing values.
## 'Unknown' and blank values are replaced with most frequent values (mode) appeared in their respective category, the modes are like the following:
## Job: admin.
## Marital: married
## Education: university.degree
## Housing: yes
## Loan: no
Also, for the categorical data, I chose to fill the blanks and unknowns with the mode of the columns, this shows that I am assuming the unknowns and the blanks has the same value as the majority values, but in here the column ‘default’ has the answer ‘unknown’ but since unknown here has meaning that is why I didn’t convert it to the mode of the default column, but just let it be.
##
## Column: job
##
## admin. blue-collar entrepreneur housemaid management
## 2106 1870 294 219 546
## retired self-employed services student technician
## 366 263 822 184 1372
## unemployed
## 196
##
## Column: marital
##
## divorced married single
## 892 5008 2338
##
## Column: education
##
## basic.4y basic.6y basic.9y high.school
## 817 486 1243 1915
## illiterate professional.course university.degree
## 6 1043 2728
##
## Column: default
##
## no unknown
## 6491 1747
##
## Column: housing
##
## no yes
## 3749 4489
##
## Column: loan
##
## no yes
## 6946 1292
##
## Column: contact
##
## cellular telephone
## 5232 3006
##
## Column: month
##
## apr aug dec jul jun mar may nov oct sep
## 529 1189 38 1481 1074 105 2729 827 149 117
##
## Column: day_of_week
##
## fri mon thu tue wed
## 1524 1676 1769 1599 1670
##
## Column: poutcome
##
## failure nonexistent success
## 836 7136 266
##
## Column: y
##
## no yes
## 7329 909
If we check again for unknown in job, marital, education, housing, loan or ” ” in job it will no longer exist because it has been replaced with the mode of each respective columns.
| Variable | Mean | Median | SD | IQR | Min | Max |
|---|---|---|---|---|---|---|
| age | 39.93 | 38 | 10.51 | 15 | 17 | 89 |
| duration | 258.07 | 180 | 266.56 | 217 | 4 | 4918 |
| campaign | 2.57 | 2 | 2.79 | 2 | 1 | 43 |
| pdays | 963.90 | 999 | 183.41 | 0 | 0 | 999 |
| previous | 0.17 | 0 | 0.49 | 0 | 0 | 6 |
| job | Count | Mean_Age | Median_Age | SD_Age | Min_Age | Max_Age |
|---|---|---|---|---|---|---|
| admin. | 2106 | 38.4 | 36 | 9.1 | 21 | 70 |
| blue-collar | 1870 | 39.3 | 38 | 9.0 | 20 | 80 |
| technician | 1372 | 38.4 | 36 | 8.5 | 22 | 70 |
| services | 822 | 37.6 | 36 | 9.3 | 20 | 60 |
| management | 546 | 42.4 | 41 | 9.3 | 23 | 77 |
| retired | 366 | 61.8 | 59 | 10.5 | 26 | 89 |
| entrepreneur | 294 | 40.7 | 39 | 9.0 | 20 | 60 |
| self-employed | 263 | 39.9 | 39 | 9.4 | 23 | 60 |
| housemaid | 219 | 45.8 | 45 | 10.6 | 23 | 82 |
| unemployed | 196 | 39.8 | 39 | 8.9 | 23 | 66 |
| student | 184 | 26.0 | 26 | 4.9 | 17 | 44 |
| y | Count | Percentage | Avg_Age | Avg_Duration | Avg_Campaign |
|---|---|---|---|---|---|
| no | 7329 | 89 | 39.8 | 221.5 | 2.6 |
| yes | 909 | 11 | 40.8 | 552.8 | 2.0 |
Above is the summary table of campaign outcome distribution, most of the campaign didn’t come to a success since a lot of the clients rejected or doesn’t subscribe to a term deposit. The one that rejects has the average age of ~40 and it took an average of 3 contacts to make them reject and those who accepted has the average age of ~41 and it took them an average of 2 contacts to be convinced.
| Variable | Pearson | Spearman |
|---|---|---|
| age | 0.029 | -0.015 |
| duration | 0.389 | 0.339 |
| campaign | -0.069 | -0.066 |
| previous | 0.219 | 0.184 |
| y | Count | Mean_Duration | Median_Duration | SD_Duration | Min_Duration | Max_Duration |
|---|---|---|---|---|---|---|
| no | 7329 | 221.5 | 165 | 210.8 | 4 | 4918 |
| yes | 909 | 552.8 | 439 | 433.7 | 63 | 4199 |
| job | no | yes | Total | SuccessRate |
|---|---|---|---|---|
| student | 130 | 54 | 184 | 29.3 |
| retired | 276 | 90 | 366 | 24.6 |
| unemployed | 171 | 25 | 196 | 12.8 |
| admin. | 1844 | 262 | 2106 | 12.4 |
| self-employed | 234 | 29 | 263 | 11.0 |
| technician | 1222 | 150 | 1372 | 10.9 |
| management | 488 | 58 | 546 | 10.6 |
| housemaid | 196 | 23 | 219 | 10.5 |
| services | 749 | 73 | 822 | 8.9 |
| entrepreneur | 268 | 26 | 294 | 8.8 |
| blue-collar | 1751 | 119 | 1870 | 6.4 |
| poutcome | no | yes | Total | Success_Rate |
|---|---|---|---|---|
| success | 98 | 168 | 266 | 63.2 |
| failure | 728 | 108 | 836 | 12.9 |
| nonexistent | 6503 | 633 | 7136 | 8.9 |
This table shows how past campaign results can actually predict current success. Clients with previous successful campaigns have a 63.2% success rate in the current campaign, which is dramatically higher than those with previous outcome being failures (has a 12.9% success rate) or no previous contact has ever been made before and this is their first campaign with the success rate of 8.9%. This indicates that clients who responded positively before are much more likely to respond positively again, suggesting strong customer loyalty and satisfaction with the bank’s services. The pattern indicates that it is important to stay in good terms with previously successfully convinced client.
| Age_Dist | Count |
|---|---|
| <30 | 1187 |
| >50 | 1443 |
| 30-50 | 5608 |
| Recency_Dist | Count |
|---|---|
| never | 7947 |
| old | 2 |
| recent | 289 |
| Total_Contact_Score | Count |
|---|---|
| 1 | 3038 |
| 2 | 2188 |
| 3 | 1285 |
| 4 | 633 |
| 5 | 351 |
| 6 | 226 |
| 7 | 155 |
| 8 | 75 |
| 9 | 55 |
| 10 | 55 |
| 11 | 35 |
| 12 | 19 |
| 13 | 17 |
| 14 | 17 |
| 15 | 12 |
| 16 | 13 |
| 17 | 15 |
| 18 | 11 |
| 19 | 2 |
| 20 | 4 |
| 21 | 6 |
| 22 | 3 |
| 23 | 1 |
| 24 | 2 |
| 25 | 2 |
| 26 | 2 |
| 27 | 4 |
| 29 | 3 |
| 30 | 2 |
| 31 | 3 |
| 33 | 1 |
| 35 | 2 |
| 43 | 1 |
The total contact score distribution shows that most clients, 3.038 clients in total, have a score of 1, meaning they were contacted only once across all campaigns. The frequency decreases as the contact score increases, with 2.188 clients having a score of 2 contacts, and 1.285 clients having 3 contacts made. Very few clients have high contact value, with only 1 client having the maximum value of contact made that is on 43 contacts. This indicates that the bank typically makes minimal contact attempts with most clients, and follow-up contact is rarely done.
For interactive visualization between 2 categorical variable, I chose to use the same columns as I did for the variable correlation analysis in 2a, which are job and y. Here the bar chart compares job types with campaign outcomes, sorted by total client count from lowest to highest (so the chart is prettier and easier to read). Admin workers show the highest numbers for both ‘yes’ and ‘no’. Students and retired clients, even when they have smaller sample sizes, appear to have relatively higher acceptance rates compared to other columns, retired clients have 90 ‘yes’ which is on the 4th place if we sort the frequency of ‘yes’ from highest to lowest, and students are in 7th place with n=54 which means the count for ‘yes’ is 54. The visualization clearly shows that while admin and blue-collar workers make up the majority of contacts, their success rates are low compared to students. Like we analyzed before in correlation, student has 29.3% success rate while admin has 12.4% success rate and blue-collar has 6.4% success rate.
For the categorical and numerical variable visualization, I chose y and age. This interactive boxplot shows the age distribution for each current campaign outcome, which is yes and no. The successful and unsuccessful campaign outcome show similar median ages around late 30 age, those who said ‘no’ has the median of 38 years old and those who says ‘yes’ has the median of 36 years old, with the successful group having a slightly lower median age. The interquartile ranges are also quite similar between the two groups, suggesting that age alone is not be a strong predictor of whether the campaign is successful or not. But, the successful group shows a slightly better distribution with fewer outliers.
This multi-dimensional bar plot shows current campaign success rates grouped by age group column that we made in 2b, also in each of the age group it has more category inside which is the education level column again by education level. The visualization reveals how education level influences campaign success within different age groups. Higher education levels, particularly university degrees, professional courses and high school, consistently show better success rates across all age groups, with the youngest group (under 30 years old) achieving the highest success rates among all age group with the success rate of their university degree clients. The middle-aged group (30-50 years old) shows okay success rates with university degree and it has the highest success rate among the age group but the university degree on this age group is the lowest among other age group, while the oldest group (above 50 years old) demonstrates the a moderate success rates across all education levels, with the highest success rate is with clients with a basic 4y education and the lowest success rate is clients with basic 9y education level. This pattern means that both younger age and higher education can impact campaign acceptance significantly/positively, though the >50 age group shows a different education preference pattern compared to younger groups with the highest success rate is those who have basic 4y education level, because maybe the era is different too in terms of education.
This chart reveals that campaign success rates will increase with longer call duration, because the lowest success rate is 0.7% and it’s for very short calls under 100 seconds then it grew to a success rate of 47.3% for calls lasting more than 600 seconds. Even the difference between medium and long duration is pretty significant with almost 30%. So the possible correlation here is because genuinely engaged customers will naturally comply to participate in longer discussions and that is where bank representatives can properly explain benefits and address concerns of the campaign they are doing currently.
First, implement minimum call duration standards of 300 seconds (5 minutes), as calls under this time cap have less than 8% success probability and this will mean the effort that the bank representative has done is wasted. Second, teach agents to slow down and have real conversations with customers instead of rushing through calls, because taking more time leads to much better results. Bank representatives then can explain benefits clearly, and take time to answer questions, since our data proves that longer conversations of 10+ minutes are much more successful than rushing through quick calls.
This chart reveals that customer loyalty from previous campaigns is the strongest predictor of current campaign success, showing a dramatic 7x difference between satisfied and new customers. First-time contacts achieve only 8.9% success rate, previous failures show slight improvement at 12.9%, but customers with previous successful experiences shows loyalty with the bank with 63.2% success rate. The massive gap between previous success (63.2%) and first-time contacts (8.9%) and previous failure (12.9%) indicates that satisfied customers are willing to engage in future offerings. In this case, the clients who previously accepted the campaign are more likely to accept the current campaign because maybe the bank really does give a good benefit for it. This pattern suggests that maintaining good relationships with existing successful customers is effectively impacting the future campaign outcomes too, this might be better than searching new prospects.
First, prioritize campaigns targeting customers with previous successful outcomes, since they have a 63.2% success rate compared to only 8.9% for new contacts, this means focusing on satisfied customers is way more effective than contacting new prospects. Second, create a customer loyalty program to keep successful customers engaged between campaigns, because the data proves that once customers trust the bank and have positive experiences, they become highly likely to accept future offers. Bank representatives should maintain regular contact with these valuable customers through follow-up calls or special offers to make them want to remain there being satisfied and ready for the next campaign.